Skip to content

[fix] avoid concurrent tablet stat iteration failures#63298

Open
yx-keith wants to merge 3 commits into
apache:masterfrom
yx-keith:fix-tablet-stat-concurrency
Open

[fix] avoid concurrent tablet stat iteration failures#63298
yx-keith wants to merge 3 commits into
apache:masterfrom
yx-keith:fix-tablet-stat-concurrency

Conversation

@yx-keith
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: #59138

Related PR: #xxx

Problem Summary:
TabletStatMgr may hit concurrency issues when FE metadata changes during tablet stat collection. MaterializedIndex.getTablets() and LocalTablet.getReplicas() return internal mutable lists directly, and updateTabletStat() also has a stale-metadata window between getTabletMeta() and getReplica().

Solution:
This PR makes the tablet stat read path more robust:
return snapshot lists from MaterializedIndex.getTablets() and LocalTablet.getReplicas()
skip stale stat updates in TabletStatMgr.updateTabletStat() when tablet metadata is removed concurrently

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30762 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17684	3852	3886	3852
q2	q3	10794	1362	839	839
q4	4680	465	348	348
q5	7585	2254	2109	2109
q6	323	171	142	142
q7	942	761	640	640
q8	9454	1679	1601	1601
q9	6740	4943	4899	4899
q10	6445	2108	1803	1803
q11	426	282	236	236
q12	685	422	290	290
q13	18228	3381	2755	2755
q14	257	257	232	232
q15	q16	825	781	705	705
q17	947	958	927	927
q18	6750	5633	5453	5453
q19	1204	1219	1038	1038
q20	524	398	259	259
q21	5630	2617	2327	2327
q22	432	358	307	307
Total cold run time: 100555 ms
Total hot run time: 30762 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4267	4195	4164	4164
q2	q3	4432	4897	4353	4353
q4	2158	2189	1375	1375
q5	4416	4241	4284	4241
q6	224	173	127	127
q7	2090	1899	1611	1611
q8	2633	2153	2143	2143
q9	7781	7702	7721	7702
q10	4570	4477	4082	4082
q11	585	422	370	370
q12	894	750	515	515
q13	3286	3622	3039	3039
q14	301	301	277	277
q15	q16	720	727	651	651
q17	1381	1340	1377	1340
q18	7904	7359	7085	7085
q19	1091	1099	1071	1071
q20	2213	2199	1934	1934
q21	5343	4619	4507	4507
q22	531	459	400	400
Total cold run time: 56820 ms
Total hot run time: 50987 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 167902 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

query5	4350	651	509	509
query6	342	227	195	195
query7	4300	536	306	306
query8	319	226	218	218
query9	8800	3992	3995	3992
query10	448	332	305	305
query11	5794	2349	2212	2212
query12	180	131	128	128
query13	1311	616	404	404
query14	5900	5342	5043	5043
query14_1	4373	4345	4322	4322
query15	209	208	185	185
query16	989	458	472	458
query17	1161	742	612	612
query18	2512	482	368	368
query19	222	212	170	170
query20	146	135	134	134
query21	222	143	124	124
query22	13636	13525	13451	13451
query23	17127	16371	15996	15996
query23_1	16213	16078	16194	16078
query24	7408	1743	1284	1284
query24_1	1281	1293	1292	1292
query25	539	460	404	404
query26	1306	317	173	173
query27	2703	551	357	357
query28	4431	1964	1935	1935
query29	1001	623	496	496
query30	306	239	200	200
query31	1115	1049	933	933
query32	86	73	74	73
query33	530	354	292	292
query34	1157	1111	639	639
query35	774	769	679	679
query36	1340	1304	1204	1204
query37	164	106	88	88
query38	3212	3131	3038	3038
query39	921	916	903	903
query39_1	871	894	870	870
query40	234	148	125	125
query41	66	66	63	63
query42	108	109	116	109
query43	327	321	284	284
query44	
query45	208	204	189	189
query46	1096	1157	711	711
query47	2329	2332	2167	2167
query48	406	395	297	297
query49	629	489	365	365
query50	956	334	256	256
query51	4302	4293	4238	4238
query52	105	104	94	94
query53	252	286	207	207
query54	314	265	245	245
query55	89	86	86	86
query56	284	292	303	292
query57	1433	1470	1379	1379
query58	327	275	269	269
query59	1601	1723	1501	1501
query60	320	330	315	315
query61	162	154	151	151
query62	669	627	576	576
query63	241	204	211	204
query64	2415	813	621	621
query65	
query66	1742	488	366	366
query67	30252	30162	29001	29001
query68	
query69	455	334	299	299
query70	1015	949	995	949
query71	307	273	267	267
query72	2900	2684	2367	2367
query73	883	737	395	395
query74	5064	4945	4757	4757
query75	2681	2595	2249	2249
query76	2310	1134	773	773
query77	395	414	355	355
query78	12201	12121	11639	11639
query79	1381	1037	718	718
query80	656	581	476	476
query81	467	280	248	248
query82	427	156	124	124
query83	357	277	257	257
query84	268	144	110	110
query85	955	629	450	450
query86	407	345	313	313
query87	3378	3322	3238	3238
query88	3464	2664	2643	2643
query89	432	397	334	334
query90	1936	172	183	172
query91	176	173	143	143
query92	80	78	73	73
query93	1605	1475	881	881
query94	531	352	324	324
query95	681	387	341	341
query96	965	821	325	325
query97	2685	2687	2566	2566
query98	238	224	239	224
query99	1120	1072	945	945
Total cold run time: 251899 ms
Total hot run time: 167902 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 14.29% (2/14) 🎉
Increment coverage report
Complete coverage report

@yx-keith
Copy link
Copy Markdown
Contributor Author

run p0

@yx-keith
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31516 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17807	3974	3976	3974
q2	q3	10830	1375	811	811
q4	4692	479	353	353
q5	7601	2296	2111	2111
q6	233	181	145	145
q7	985	784	633	633
q8	9404	1648	1605	1605
q9	5183	4941	4979	4941
q10	6403	2073	1784	1784
q11	443	276	243	243
q12	627	421	292	292
q13	18115	3388	2807	2807
q14	267	256	261	256
q15	q16	825	772	711	711
q17	1006	974	915	915
q18	7064	5624	5481	5481
q19	1297	1326	1063	1063
q20	515	543	308	308
q21	6340	2885	2748	2748
q22	476	385	335	335
Total cold run time: 100113 ms
Total hot run time: 31516 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5041	4642	4657	4642
q2	q3	4851	5223	4613	4613
q4	2123	2202	1409	1409
q5	4863	4651	4629	4629
q6	233	179	134	134
q7	1955	1727	1492	1492
q8	2466	2163	2107	2107
q9	7740	7521	7219	7219
q10	4463	4401	3975	3975
q11	539	378	369	369
q12	708	716	527	527
q13	3005	3411	2799	2799
q14	267	268	258	258
q15	q16	690	702	626	626
q17	1274	1252	1245	1245
q18	7282	6692	6688	6688
q19	1110	1104	1095	1095
q20	2218	2220	1942	1942
q21	5335	4647	4518	4518
q22	522	456	414	414
Total cold run time: 56685 ms
Total hot run time: 50701 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169444 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit be901132f6069d74136c6c887890d4831902fc1f, data reload: false

query5	4325	666	521	521
query6	343	211	193	193
query7	4231	576	308	308
query8	336	243	246	243
query9	8849	4038	4045	4038
query10	448	332	299	299
query11	5848	2401	2226	2226
query12	179	131	127	127
query13	1308	619	438	438
query14	6052	5348	5071	5071
query14_1	4349	4337	4354	4337
query15	218	204	188	188
query16	1039	451	445	445
query17	1176	730	642	642
query18	2806	491	361	361
query19	224	211	188	188
query20	143	136	135	135
query21	226	144	120	120
query22	13705	13523	13400	13400
query23	17151	16338	16101	16101
query23_1	16172	16153	16262	16153
query24	7514	1792	1308	1308
query24_1	1314	1326	1294	1294
query25	577	512	443	443
query26	1322	317	176	176
query27	2730	571	349	349
query28	4379	1970	1965	1965
query29	1010	643	524	524
query30	304	230	202	202
query31	1124	1068	941	941
query32	90	81	78	78
query33	561	405	298	298
query34	1158	1143	637	637
query35	774	773	680	680
query36	1286	1379	1166	1166
query37	159	108	93	93
query38	3220	3151	3062	3062
query39	931	924	891	891
query39_1	887	895	883	883
query40	239	144	127	127
query41	71	67	63	63
query42	116	116	113	113
query43	324	329	282	282
query44	
query45	213	199	194	194
query46	1057	1202	721	721
query47	2320	2350	2161	2161
query48	376	429	298	298
query49	637	489	381	381
query50	956	356	252	252
query51	4274	4372	4176	4176
query52	105	106	94	94
query53	254	285	207	207
query54	316	267	258	258
query55	94	90	85	85
query56	298	309	307	307
query57	1398	1379	1268	1268
query58	301	282	271	271
query59	1561	1600	1444	1444
query60	322	324	319	319
query61	155	160	157	157
query62	669	612	558	558
query63	241	202	203	202
query64	2370	780	634	634
query65	
query66	1643	486	396	396
query67	30062	29933	29713	29713
query68	
query69	457	349	311	311
query70	1001	999	992	992
query71	319	276	283	276
query72	3039	2736	2433	2433
query73	879	803	422	422
query74	5100	4913	4718	4718
query75	2683	2623	2253	2253
query76	2279	1152	791	791
query77	411	409	341	341
query78	12134	12110	11647	11647
query79	1463	1085	741	741
query80	1250	554	471	471
query81	518	280	242	242
query82	1381	162	127	127
query83	349	281	251	251
query84	258	136	113	113
query85	967	532	459	459
query86	429	327	346	327
query87	3467	3362	3195	3195
query88	3605	2697	2640	2640
query89	458	392	336	336
query90	1823	184	178	178
query91	183	175	145	145
query92	79	77	70	70
query93	1469	1477	823	823
query94	636	369	309	309
query95	682	478	350	350
query96	990	742	319	319
query97	2685	2729	2566	2566
query98	240	239	237	237
query99	1117	1120	1005	1005
Total cold run time: 254349 ms
Total hot run time: 169444 ms

@morningman
Copy link
Copy Markdown
Contributor

/review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants